Fit the P/NBD Model to External Data
1 Load Transactional Datasets
We first want to load the real-world transactional dataset.
1.1 Load Pre-processed Transactional Data
customer_cohortdata_tbl <- read_rds("data/customer_cohort_tbl.rds")
customer_cohortdata_tbl %>% glimpse()## Rows: 5,852
## Columns: 5
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", …
## $ cohort_qtr <yearqtr> 2010 Q1, 2010 Q4, 2010 Q3, 2010 Q2, 2011 Q1, 2010 …
## $ cohort_ym <chr> "2010 03", "2010 10", "2010 09", "2010 04", "2011 02",…
## $ first_tnx_date <date> 2010-03-02, 2010-10-31, 2010-09-27, 2010-04-29, 2011-…
## $ total_tnx_count <int> 3, 8, 5, 3, 1, 1, 9, 2, 1, 2, 6, 2, 5, 10, 6, 4, 10, 2…
We also want to load the raw transaction data as we want to transform the data into a form we now use.
retail_transaction_data_tbl <- read_rds("data/retail_data_cleaned_tbl.rds")
retail_transaction_data_tbl %>% glimpse()## Rows: 1,021,424
## Columns: 23
## $ row_id <chr> "ROW0000001", "ROW0000002", "ROW0000003", "ROW000000…
## $ excel_sheet <chr> "Year 2009-2010", "Year 2009-2010", "Year 2009-2010"…
## $ invoice_id <chr> "489434", "489434", "489434", "489434", "489434", "4…
## $ stock_code <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ description <chr> "15CM CHRISTMAS GLASS BALL 20 LIGHTS", "PINK CHERRY …
## $ quantity <dbl> 12, 12, 12, 48, 24, 24, 24, 10, 12, 12, 24, 12, 10, …
## $ invoice_date <date> 2009-12-01, 2009-12-01, 2009-12-01, 2009-12-01, 200…
## $ price <dbl> 6.95, 6.75, 6.75, 2.10, 1.25, 1.65, 1.25, 5.95, 2.55…
## $ customer_id <chr> "13085", "13085", "13085", "13085", "13085", "13085"…
## $ country <chr> "United Kingdom", "United Kingdom", "United Kingdom"…
## $ stock_code_upr <chr> "85048", "79323P", "79323W", "22041", "21232", "2206…
## $ cancellation <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
## $ invoice_dttm <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:00, 2009-12-0…
## $ invoice_month <chr> "December", "December", "December", "December", "Dec…
## $ invoice_dow <chr> "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday…
## $ invoice_dom <chr> "01", "01", "01", "01", "01", "01", "01", "01", "01"…
## $ invoice_hour <chr> "07", "07", "07", "07", "07", "07", "07", "07", "07"…
## $ invoice_minute <chr> "45", "45", "45", "45", "45", "45", "45", "45", "45"…
## $ invoice_woy <chr> "49", "49", "49", "49", "49", "49", "49", "49", "49"…
## $ invoice_ym <chr> "200912", "200912", "200912", "200912", "200912", "2…
## $ stock_value <dbl> 83.40, 81.00, 81.00, 100.80, 30.00, 39.60, 30.00, 59…
## $ invoice_monthprop <dbl> 0.04347826, 0.04347826, 0.04347826, 0.04347826, 0.04…
## $ exclude <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FAL…
We need to aggregate this data up into a form to match our synthetic data, so
we aggregate transactions by invoice_id.
customer_transactions_tbl <- retail_transaction_data_tbl %>%
drop_na(customer_id) %>%
filter(exclude = TRUE) %>%
group_by(tnx_timestamp = invoice_dttm, customer_id, invoice_id) %>%
summarise(
.groups = "drop",
total_spend = sum(stock_value)
) %>%
filter(total_spend > 0) %>%
arrange(tnx_timestamp, customer_id)
customer_transactions_tbl %>% glimpse()## Rows: 37,031
## Columns: 4
## $ tnx_timestamp <dttm> 2009-12-01 07:45:00, 2009-12-01 07:45:59, 2009-12-01 09…
## $ customer_id <chr> "13085", "13085", "13078", "15362", "18102", "12682", "1…
## $ invoice_id <chr> "489434", "489435", "489436", "489437", "489438", "48943…
## $ total_spend <dbl> 505.30, 145.80, 630.33, 310.75, 2286.24, 426.30, 50.40, …
We re-produce the visualisation of the transaction times we used in previous workbooks.
plot_tbl <- customer_transactions_tbl %>%
group_nest(customer_id, .key = "cust_data") %>%
filter(map_int(cust_data, nrow) > 3) %>%
slice_sample(n = 30) %>%
unnest(cust_data)
ggplot(plot_tbl, aes(x = tnx_timestamp, y = customer_id)) +
geom_line() +
geom_point() +
labs(
x = "Date",
y = "Customer ID",
title = "Visualisation of Customer Transaction Times"
) +
theme(axis.text.y = element_text(size = 10))2 Fit the Fixed Prior P/NBD Model
stan_modeldir <- "stan_models"
stan_codedir <- "stan_code"We first need to construct our fitted dataset from this external data.
In terms of choosing a cut-off point, we will consider all transactions up to and including March 31, 2011.
btyd_fitdata_tbl <- customer_transactions_tbl %>%
calculate_transaction_cbs_data(last_date = as.POSIXct("2011-04-01"))
btyd_fitdata_tbl %>% glimpse()## Rows: 4,716
## Columns: 6
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", "…
## $ first_tnx_date <dttm> 2009-12-14 08:34:00, 2010-10-31 14:19:59, 2010-09-27 1…
## $ last_tnx_date <dttm> 2011-01-18 10:00:59, 2011-01-26 14:29:59, 2011-01-25 1…
## $ x <dbl> 11, 2, 2, 2, 0, 0, 6, 0, 0, 3, 1, 2, 7, 4, 3, 1, 1, 0, …
## $ t_x <dbl> 57.151488095, 12.429563492, 17.117361111, 25.970535714,…
## $ T_cal <dbl> 67.520437, 21.628968, 26.482242, 48.063492, 8.190377, 1…
We also want to construct some summary statistics for the data after that.
btyd_obs_stats_tbl <- customer_transactions_tbl %>%
filter(
tnx_timestamp >= as.POSIXct("2011-04-01")
) %>%
group_by(customer_id) %>%
summarise(
.groups = "drop",
tnx_count = n(),
first_tnx = min(tnx_timestamp),
last_tnx = max(tnx_timestamp)
)
btyd_obs_stats_tbl %>% glimpse()## Rows: 3,849
## Columns: 4
## $ customer_id <chr> "12347", "12348", "12349", "12352", "12353", "12354", "123…
## $ tnx_count <int> 5, 2, 1, 3, 1, 1, 1, 2, 1, 2, 2, 3, 9, 2, 4, 1, 1, 2, 2, 1…
## $ first_tnx <dttm> 2011-04-07 10:43:00, 2011-04-05 10:47:00, 2011-11-21 09:5…
## $ last_tnx <dttm> 2011-12-07 15:51:59, 2011-09-25 13:13:00, 2011-11-21 09:5…
We now compile this model using CmdStanR.
pnbd_extdata_fixed_stanmodel <- cmdstan_model(
"stan_code/pnbd_fixed.stan",
include_paths = stan_codedir,
pedantic = TRUE,
dir = stan_modeldir
)2.1 Fit the Model
We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "pnbd_extdata_fixed"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- btyd_fitdata_tbl %>%
select(customer_id, x, t_x, T_cal) %>%
compose_data(
lambda_mn = 0.25,
lambda_cv = 0.60,
mu_mn = 0.10,
mu_cv = 0.60,
)
pnbd_extdata_fixed_stanfit <- pnbd_extdata_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 200,
iter_sampling = 50,
seed = 4201,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 Iteration: 1 / 250 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 250 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 250 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 250 [ 0%] (Warmup)
## Chain 1 Iteration: 100 / 250 [ 40%] (Warmup)
## Chain 2 Iteration: 100 / 250 [ 40%] (Warmup)
## Chain 3 Iteration: 100 / 250 [ 40%] (Warmup)
## Chain 4 Iteration: 100 / 250 [ 40%] (Warmup)
## Chain 2 Iteration: 200 / 250 [ 80%] (Warmup)
## Chain 2 Iteration: 201 / 250 [ 80%] (Sampling)
## Chain 1 Iteration: 200 / 250 [ 80%] (Warmup)
## Chain 1 Iteration: 201 / 250 [ 80%] (Sampling)
## Chain 3 Iteration: 200 / 250 [ 80%] (Warmup)
## Chain 3 Iteration: 201 / 250 [ 80%] (Sampling)
## Chain 4 Iteration: 200 / 250 [ 80%] (Warmup)
## Chain 4 Iteration: 201 / 250 [ 80%] (Sampling)
## Chain 2 Iteration: 250 / 250 [100%] (Sampling)
## Chain 1 Iteration: 250 / 250 [100%] (Sampling)
## Chain 2 finished in 29.0 seconds.
## Chain 1 finished in 29.3 seconds.
## Chain 3 Iteration: 250 / 250 [100%] (Sampling)
## Chain 3 finished in 30.9 seconds.
## Chain 4 Iteration: 250 / 250 [100%] (Sampling)
## Chain 4 finished in 31.8 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 30.2 seconds.
## Total execution time: 32.1 seconds.
pnbd_extdata_fixed_stanfit$summary()## # A tibble: 14,149 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -1.36e+5 -1.36e+5 69.4 68.9 -1.36e+5 -1.36e+5 1.02 91.2
## 2 lambda[1] 1.90e-1 1.86e-1 0.0473 0.0508 1.18e-1 2.74e-1 1.00 285.
## 3 lambda[2] 1.64e-1 1.53e-1 0.0761 0.0646 6.21e-2 3.17e-1 1.05 79.7
## 4 lambda[3] 1.46e-1 1.31e-1 0.0633 0.0579 5.83e-2 2.61e-1 0.996 303.
## 5 lambda[4] 1.12e-1 1.08e-1 0.0531 0.0540 4.16e-2 2.06e-1 1.05 285.
## 6 lambda[5] 1.88e-1 1.70e-1 0.117 0.111 4.24e-2 3.91e-1 1.01 210.
## 7 lambda[6] 1.85e-1 1.58e-1 0.118 0.0976 4.11e-2 4.20e-1 1.04 168.
## 8 lambda[7] 2.80e-1 2.71e-1 0.109 0.114 1.08e-1 4.69e-1 1.03 163.
## 9 lambda[8] 2.00e-1 1.77e-1 0.134 0.119 3.69e-2 4.77e-1 1.02 446.
## 10 lambda[9] 1.89e-1 1.58e-1 0.120 0.107 3.34e-2 4.30e-1 1.03 107.
## # … with 14,139 more rows, and 1 more variable: ess_tail <dbl>
We have some basic HMC-based validity statistics we can check.
pnbd_extdata_fixed_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed-1.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed-2.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed-3.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed-4.csvWarning: non-fatal error reading adaptation data
##
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## The following parameters had split R-hat greater than 1.05:
## lambda[425], lambda[826], mu[1791], mu[1795], mu[2331], p_alive[9], p_alive[84], p_alive[110], p_alive[116], p_alive[144], p_alive[145], p_alive[150], p_alive[225], p_alive[227], p_alive[277], p_alive[281], p_alive[364], p_alive[390], p_alive[432], p_alive[443], p_alive[503], p_alive[552], p_alive[580], p_alive[587], p_alive[652], p_alive[687], p_alive[714], p_alive[849], p_alive[863], p_alive[923], p_alive[931], p_alive[1028], p_alive[1029], p_alive[1038], p_alive[1073], p_alive[1107], p_alive[1144], p_alive[1171], p_alive[1330], p_alive[1449], p_alive[1496], p_alive[1498], p_alive[1575], p_alive[1613], p_alive[1646], p_alive[1675], p_alive[1700], p_alive[1717], p_alive[1791], p_alive[1795], p_alive[1846], p_alive[1849], p_alive[1896], p_alive[1908], p_alive[1930], p_alive[2020], p_alive[2041], p_alive[2048], p_alive[2101], p_alive[2116], p_alive[2180], p_alive[2232], p_alive[2261], p_alive[2301], p_alive[2405], p_alive[2508], p_alive[2563], p_alive[2577], p_alive[2677], p_alive[2840], p_alive[2934], p_alive[3023], p_alive[3073], p_alive[3345], p_alive[3397], p_alive[3567], p_alive[3788], p_alive[3817], p_alive[3854], p_alive[3967], p_alive[4010], p_alive[4047], p_alive[4068], p_alive[4090], p_alive[4145], p_alive[4181], p_alive[4290], p_alive[4307], p_alive[4437], p_alive[4509], p_alive[4600]
## Such high values indicate incomplete mixing and biased estimation.
## You should consider regularizating your model with additional prior information or a more effective parameterization.
##
## Processing complete.
2.2 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_extdata_fixed_stanfit$draws(inc_warmup = FALSE) %>%
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
pnbd_extdata_fixed_stanfit %>%
rhat(pars = c("lambda", "mu")) %>%
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
pnbd_extdata_fixed_stanfit %>%
neff_ratio(pars = c("lambda", "mu")) %>%
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
pnbd_extdata_fixed_stanfit$draws() %>%
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Values")2.3 Validate the Fixed Prior Model
pnbd_extdata_fixed_validation_tbl <- pnbd_extdata_fixed_stanfit %>%
recover_types(btyd_fitdata_tbl) %>%
spread_draws(lambda[customer_id], mu[customer_id], p_alive[customer_id]) %>%
ungroup() %>%
select(
customer_id, draw_id = .draw, post_lambda = lambda, post_mu = mu, p_alive
)
pnbd_extdata_fixed_validation_tbl %>% glimpse()## Rows: 943,200
## Columns: 5
## $ customer_id <chr> "12346", "12346", "12346", "12346", "12346", "12346", "123…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ post_lambda <dbl> 0.231628, 0.177778, 0.194681, 0.206859, 0.205453, 0.184104…
## $ post_mu <dbl> 0.0250698, 0.0426753, 0.0444717, 0.0392961, 0.0384251, 0.0…
## $ p_alive <dbl> 0.434611, 0.368987, 0.329592, 0.346052, 0.354873, 0.396065…
Having constructed our simulations inputs, we now generate our simulations.
precompute_dir <- "precompute/pnbd_extdata_fixed"
precomputed_tbl <- dir_ls(precompute_dir) %>%
enframe(name = NULL, value = "sim_file") %>%
mutate(sim_file = sim_file %>% as.character())
pnbd_extdata_fixed_validsims_lookup_tbl <- pnbd_extdata_fixed_validation_tbl %>%
group_nest(customer_id, .key = "cust_params") %>%
mutate(
sim_file = glue(
"{precompute_dir}/sims_pnbd_extdata_fixed_{customer_id}.rds"
)
)
exec_tbl <- pnbd_extdata_fixed_validsims_lookup_tbl %>%
anti_join(precomputed_tbl, by = "sim_file")
if(exec_tbl %>% nrow() > 0) {
exec_tbl %>%
mutate(
calc_file = future_map2_lgl(
sim_file, cust_params,
run_pnbd_simulations_chunk,
start_dttm = as.POSIXct("2011-04-01"),
end_dttm = as.POSIXct("2011-12-10"),
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = FALSE,
seed = 4202
),
.progress = TRUE
)
)
}
exec_tbl %>% glimpse()## Rows: 0
## Columns: 3
## $ customer_id <chr>
## $ cust_params <list<tibble[,4]>> list()
## $ sim_file <glue>
pnbd_extdata_fixed_validsims_lookup_tbl %>% glimpse()## Rows: 4,716
## Columns: 3
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", "123…
## $ cust_params <list<tibble[,4]>> [<tbl_df[200 x 4]>], [<tbl_df[200 x 4]>], [<t…
## $ sim_file <glue> "precompute/pnbd_extdata_fixed/sims_pnbd_extdata_fixed_12…
We now load all the simulations into a file.
pnbd_extdata_fixed_validsims_tbl <- pnbd_extdata_fixed_validsims_lookup_tbl %>%
mutate(
data = map(sim_file, ~ .x %>% read_rds() %>% select(draw_id, sim_tnx_count, sim_tnx_last))
) %>%
select(customer_id, sim_file, data) %>%
unnest(data)
pnbd_extdata_fixed_validsims_tbl %>% glimpse()## Rows: 943,200
## Columns: 5
## $ customer_id [3m[38;5;246m<chr>[39m[23m "12346", "12346", "12346", "12346", "12346", "12346", "1…
## $ sim_file [3m[38;5;246m<glue>[39m[23m "precompute/pnbd_extdata_fixed/sims_pnbd_extdata_fixed_…
## $ draw_id [3m[38;5;246m<int>[39m[23m 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ sim_tnx_count [3m[38;5;246m<int>[39m[23m 0, 2, 0, 0, 0, 5, 7, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 2, 1,…
## $ sim_tnx_last [3m[38;5;246m<dttm>[39m[23m NA, 2011-04-17 13:22:25, NA, NA, NA, 2011-09-02 07:12:2…
tnx_data_tbl <- btyd_obs_stats_tbl %>%
semi_join(pnbd_extdata_fixed_validsims_tbl, by = "customer_id")
obs_customer_count <- tnx_data_tbl %>% nrow()
obs_total_tnx_count <- tnx_data_tbl %>% pull(tnx_count) %>% sum()
pnbd_extdata_fixed_tnx_simsumm_tbl <- pnbd_extdata_fixed_validsims_tbl %>%
group_by(draw_id) %>%
summarise(
.groups = "drop",
sim_customer_count = length(sim_tnx_count[sim_tnx_count > 0]),
sim_total_tnx_count = sum(sim_tnx_count)
)
ggplot(pnbd_extdata_fixed_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_customer_count), binwidth = 10) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Simulated Customers With Transactions",
y = "Frequency",
title = "Histogram of Count of Customers Transacted",
subtitle = "Observed Count in Red"
)ggplot(pnbd_extdata_fixed_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_total_tnx_count), binwidth = 50) +
geom_vline(aes(xintercept = obs_total_tnx_count), colour = "red") +
labs(
x = "Simulated Transaction Count",
y = "Frequency",
title = "Histogram of Count of Total Transaction Count",
subtitle = "Observed Count in Red"
)2.4 Write to Disk
pnbd_extdata_fixed_validsims_tbl %>% write_rds("data/pnbd_extdata_fixed_validsims_tbl.rds")3 Fit Alternative Prior Model
3.1 Fit the Model
We then use this compiled model with our data to produce a fit of the data.
stan_modelname <- "pnbd_extdata_fixed2"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- btyd_fitdata_tbl %>%
select(customer_id, x, t_x, T_cal) %>%
compose_data(
lambda_mn = 0.50,
lambda_cv = 1.00,
mu_mn = 0.05,
mu_cv = 1.00,
)
pnbd_extdata_fixed2_stanfit <- pnbd_extdata_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 200,
iter_sampling = 50,
seed = 4202,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 Iteration: 1 / 250 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 250 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 250 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 250 [ 0%] (Warmup)
## Chain 2 Iteration: 100 / 250 [ 40%] (Warmup)
## Chain 1 Iteration: 100 / 250 [ 40%] (Warmup)
## Chain 4 Iteration: 100 / 250 [ 40%] (Warmup)
## Chain 3 Iteration: 100 / 250 [ 40%] (Warmup)
## Chain 2 Iteration: 200 / 250 [ 80%] (Warmup)
## Chain 2 Iteration: 201 / 250 [ 80%] (Sampling)
## Chain 3 Iteration: 200 / 250 [ 80%] (Warmup)
## Chain 3 Iteration: 201 / 250 [ 80%] (Sampling)
## Chain 1 Iteration: 200 / 250 [ 80%] (Warmup)
## Chain 1 Iteration: 201 / 250 [ 80%] (Sampling)
## Chain 4 Iteration: 200 / 250 [ 80%] (Warmup)
## Chain 4 Iteration: 201 / 250 [ 80%] (Sampling)
## Chain 2 Iteration: 250 / 250 [100%] (Sampling)
## Chain 2 finished in 41.0 seconds.
## Chain 4 Iteration: 250 / 250 [100%] (Sampling)
## Chain 4 finished in 44.0 seconds.
## Chain 3 Iteration: 250 / 250 [100%] (Sampling)
## Chain 3 finished in 47.4 seconds.
## Chain 1 Iteration: 250 / 250 [100%] (Sampling)
## Chain 1 finished in 47.8 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 45.0 seconds.
## Total execution time: 48.0 seconds.
pnbd_extdata_fixed2_stanfit$summary()## # A tibble: 14,149 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -8.63e+4 -8.63e+4 78.6 76.6 -8.65e+4 -8.62e+4 1.16 21.2
## 2 lambda[1] 1.84e-1 1.77e-1 0.0539 0.0572 1.05e-1 2.64e-1 1.00 286.
## 3 lambda[2] 1.49e-1 1.21e-1 0.103 0.0760 2.90e-2 3.41e-1 0.997 249.
## 4 lambda[3] 1.12e-1 9.09e-2 0.0826 0.0591 2.76e-2 2.74e-1 1.00 248.
## 5 lambda[4] 7.52e-2 5.97e-2 0.0529 0.0432 1.73e-2 1.73e-1 1.01 210.
## 6 lambda[5] 1.81e-1 1.02e-1 0.232 0.117 7.37e-3 5.89e-1 0.999 325.
## 7 lambda[6] 2.32e-1 8.23e-2 0.371 0.0979 6.11e-3 9.36e-1 0.999 310.
## 8 lambda[7] 3.24e-1 3.15e-1 0.120 0.128 1.44e-1 5.38e-1 1.01 436.
## 9 lambda[8] 1.62e-1 8.32e-2 0.231 0.0959 2.30e-3 6.60e-1 1.02 150.
## 10 lambda[9] 1.99e-1 1.10e-1 0.276 0.131 7.06e-3 6.02e-1 1.01 240.
## # … with 14,139 more rows, and 1 more variable: ess_tail <dbl>
We have some basic HMC-based validity statistics we can check.
pnbd_extdata_fixed2_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed2-1.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed2-2.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed2-3.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed2-4.csvWarning: non-fatal error reading adaptation data
##
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## The following parameters had split R-hat greater than 1.05:
## lambda[1890], lambda[3387], lambda[3614], lambda[3865], mu[1145], mu[2117], mu[2243], mu[2560], mu[2697], mu[4046], p_alive[19], p_alive[79], p_alive[193], p_alive[227], p_alive[248], p_alive[303], p_alive[330], p_alive[375], p_alive[383], p_alive[438], p_alive[444], p_alive[474], p_alive[484], p_alive[494], p_alive[501], p_alive[503], p_alive[526], p_alive[536], p_alive[554], p_alive[586], p_alive[595], p_alive[597], p_alive[633], p_alive[641], p_alive[642], p_alive[649], p_alive[650], p_alive[661], p_alive[717], p_alive[718], p_alive[727], p_alive[747], p_alive[748], p_alive[759], p_alive[782], p_alive[788], p_alive[792], p_alive[794], p_alive[825], p_alive[847], p_alive[888], p_alive[911], p_alive[920], p_alive[940], p_alive[946], p_alive[957], p_alive[997], p_alive[1017], p_alive[1029], p_alive[1033], p_alive[1039], p_alive[1042], p_alive[1050], p_alive[1096], p_alive[1104], p_alive[1117], p_alive[1145], p_alive[1146], p_alive[1148], p_alive[1163], p_alive[1202], p_alive[1207], p_alive[1209], p_alive[1226], p_alive[1237], p_alive[1256], p_alive[1343], p_alive[1374], p_alive[1397], p_alive[1427], p_alive[1485], p_alive[1490], p_alive[1495], p_alive[1496], p_alive[1518], p_alive[1525], p_alive[1532], p_alive[1555], p_alive[1559], p_alive[1566], p_alive[1614], p_alive[1627], p_alive[1654], p_alive[1676], p_alive[1681], p_alive[1695], p_alive[1733], p_alive[1742], p_alive[1756], p_alive[1810], p_alive[1813], p_alive[1841], p_alive[1858], p_alive[1890], p_alive[1912], p_alive[1913], p_alive[1922], p_alive[1923], p_alive[1959], p_alive[1993], p_alive[2000], p_alive[2071], p_alive[2073], p_alive[2085], p_alive[2089], p_alive[2111], p_alive[2117], p_alive[2125], p_alive[2155], p_alive[2163], p_alive[2230], p_alive[2234], p_alive[2243], p_alive[2268], p_alive[2271], p_alive[2289], p_alive[2296], p_alive[2301], p_alive[2309], p_alive[2312], p_alive[2316], p_alive[2324], p_alive[2349], p_alive[2368], p_alive[2395], p_alive[2418], p_alive[2422], p_alive[2431], p_alive[2460], p_alive[2470], p_alive[2488], p_alive[2525], p_alive[2560], p_alive[2561], p_alive[2582], p_alive[2586], p_alive[2607], p_alive[2644], p_alive[2646], p_alive[2648], p_alive[2661], p_alive[2665], p_alive[2668], p_alive[2681], p_alive[2697], p_alive[2698], p_alive[2759], p_alive[2778], p_alive[2788], p_alive[2818], p_alive[2861], p_alive[2865], p_alive[2892], p_alive[2968], p_alive[2980], p_alive[2988], p_alive[3003], p_alive[3023], p_alive[3036], p_alive[3042], p_alive[3068], p_alive[3114], p_alive[3127], p_alive[3143], p_alive[3144], p_alive[3157], p_alive[3160], p_alive[3163], p_alive[3183], p_alive[3221], p_alive[3245], p_alive[3274], p_alive[3292], p_alive[3316], p_alive[3350], p_alive[3378], p_alive[3391], p_alive[3398], p_alive[3420], p_alive[3425], p_alive[3426], p_alive[3433], p_alive[3448], p_alive[3494], p_alive[3515], p_alive[3545], p_alive[3548], p_alive[3583], p_alive[3605], p_alive[3692], p_alive[3698], p_alive[3743], p_alive[3766], p_alive[3773], p_alive[3881], p_alive[3920], p_alive[3971], p_alive[3978], p_alive[4013], p_alive[4020], p_alive[4046], p_alive[4067], p_alive[4070], p_alive[4101], p_alive[4112], p_alive[4142], p_alive[4160], p_alive[4163], p_alive[4181], p_alive[4190], p_alive[4210], p_alive[4216], p_alive[4225], p_alive[4279], p_alive[4288], p_alive[4307], p_alive[4399], p_alive[4437], p_alive[4439], p_alive[4453], p_alive[4469], p_alive[4486], p_alive[4488], p_alive[4508], p_alive[4514], p_alive[4528], p_alive[4540], p_alive[4557], p_alive[4581], p_alive[4626], p_alive[4627], p_alive[4645], p_alive[4648], p_alive[4690], p_alive[4699], p_alive[4709], p_alive[4715]
## Such high values indicate incomplete mixing and biased estimation.
## You should consider regularizating your model with additional prior information or a more effective parameterization.
##
## Processing complete.
3.2 Visual Diagnostics of the Sample Validity
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_extdata_fixed2_stanfit$draws(inc_warmup = FALSE) %>%
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
pnbd_extdata_fixed2_stanfit %>%
rhat(pars = c("lambda", "mu")) %>%
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
pnbd_extdata_fixed2_stanfit %>%
neff_ratio(pars = c("lambda", "mu")) %>%
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
pnbd_extdata_fixed2_stanfit$draws() %>%
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Values")3.3 Validate the Alternate Prior Model
pnbd_extdata_fixed2_validation_tbl <- pnbd_extdata_fixed2_stanfit %>%
recover_types(btyd_fitdata_tbl) %>%
spread_draws(lambda[customer_id], mu[customer_id], p_alive[customer_id]) %>%
ungroup() %>%
select(
customer_id, draw_id = .draw, post_lambda = lambda, post_mu = mu, p_alive
)
pnbd_extdata_fixed2_validation_tbl %>% glimpse()## Rows: 943,200
## Columns: 5
## $ customer_id <chr> "12346", "12346", "12346", "12346", "12346", "12346", "123…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ post_lambda <dbl> 0.135532, 0.221596, 0.146430, 0.136745, 0.220450, 0.104887…
## $ post_mu <dbl> 0.008837510, 0.008832730, 0.002376910, 0.003333760, 0.0015…
## $ p_alive <dbl> 0.824880, 0.724790, 0.944504, 0.927721, 0.939766, 0.733008…
Having constructed our simulations inputs, we now generate our simulations.
precompute_dir <- "precompute/pnbd_extdata_fixed2"
precomputed_tbl <- dir_ls(precompute_dir) %>%
enframe(name = NULL, value = "sim_file") %>%
mutate(sim_file = sim_file %>% as.character())
pnbd_extdata_fixed2_validsims_lookup_tbl <- pnbd_extdata_fixed2_validation_tbl %>%
group_nest(customer_id, .key = "cust_params") %>%
mutate(
sim_file = glue(
"{precompute_dir}/sims_pnbd_extdata_fixed2_smalliter_{customer_id}.rds"
)
)
exec_tbl <- pnbd_extdata_fixed2_validsims_lookup_tbl %>%
anti_join(precomputed_tbl, by = "sim_file")
if(exec_tbl %>% nrow() > 0) {
exec_tbl %>%
mutate(
calc_file = future_map2_lgl(
sim_file, cust_params,
run_pnbd_simulations_chunk,
start_dttm = as.POSIXct("2011-04-01"),
end_dttm = as.POSIXct("2011-12-10"),
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = FALSE,
seed = 4202
),
.progress = TRUE
)
)
}
exec_tbl %>% glimpse()## Rows: 0
## Columns: 3
## $ customer_id <chr>
## $ cust_params <list<tibble[,4]>> list()
## $ sim_file <glue>
pnbd_extdata_fixed2_validsims_lookup_tbl %>% glimpse()## Rows: 4,716
## Columns: 3
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", "123…
## $ cust_params <list<tibble[,4]>> [<tbl_df[200 x 4]>], [<tbl_df[200 x 4]>], [<t…
## $ sim_file <glue> "precompute/pnbd_extdata_fixed2/sims_pnbd_extdata_fixed2_…
We now load all the simulations into a file.
pnbd_extdata_fixed2_validsims_tbl <- pnbd_extdata_fixed2_validsims_lookup_tbl %>%
mutate(
data = map(sim_file, ~ .x %>% read_rds() %>% select(draw_id, sim_tnx_count, sim_tnx_last))
) %>%
select(customer_id, data) %>%
unnest(data)
pnbd_extdata_fixed2_validsims_tbl %>% glimpse()## Rows: 943,200
## Columns: 4
## $ customer_id <chr> "12346", "12346", "12346", "12346", "12346", "12346", "1…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ sim_tnx_count <int> 4, 4, 5, 2, 9, 2, 0, 3, 4, 4, 2, 8, 0, 5, 0, 0, 3, 6, 1,…
## $ sim_tnx_last <dttm> 2011-10-20 14:05:55, 2011-10-24 23:36:24, 2011-11-27 20…
tnx_data_tbl <- btyd_obs_stats_tbl %>%
semi_join(pnbd_extdata_fixed2_validsims_tbl, by = "customer_id")
obs_customer_count <- tnx_data_tbl %>% nrow()
obs_total_tnx_count <- tnx_data_tbl %>% pull(tnx_count) %>% sum()
pnbd_extdata_fixed2_tnx_simsumm_tbl <- pnbd_extdata_fixed2_validsims_tbl %>%
group_by(draw_id) %>%
summarise(
.groups = "drop",
sim_customer_count = length(sim_tnx_count[sim_tnx_count > 0]),
sim_total_tnx_count = sum(sim_tnx_count)
)
ggplot(pnbd_extdata_fixed2_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_customer_count), binwidth = 10) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Simulated Customers With Transactions",
y = "Frequency",
title = "Histogram of Count of Customers Transacted",
subtitle = "Observed Count in Red"
)ggplot(pnbd_extdata_fixed2_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_total_tnx_count), binwidth = 50) +
geom_vline(aes(xintercept = obs_total_tnx_count), colour = "red") +
labs(
x = "Simulated Transaction Count",
y = "Frequency",
title = "Histogram of Count of Total Transaction Count",
subtitle = "Observed Count in Red"
)3.4 Refit with Higher Iteration Samples
We now want to refit this model with a higher iteration count.
stan_modelname <- "pnbd_extdata_fixed2"
stanfit_prefix <- str_c("fit_", stan_modelname)
stan_data_lst <- btyd_fitdata_tbl %>%
select(customer_id, x, t_x, T_cal) %>%
compose_data(
lambda_mn = 0.50,
lambda_cv = 1.00,
mu_mn = 0.05,
mu_cv = 1.00,
)
pnbd_extdata_fixed2_stanfit <- pnbd_extdata_fixed_stanmodel$sample(
data = stan_data_lst,
chains = 4,
iter_warmup = 500,
iter_sampling = 500,
seed = 4203,
save_warmup = TRUE,
output_dir = stan_modeldir,
output_basename = stanfit_prefix,
)## Running MCMC with 4 chains, at most 8 in parallel...
##
## Chain 1 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 2 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 3 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 1 / 1000 [ 0%] (Warmup)
## Chain 4 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 2 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 3 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 1 Iteration: 100 / 1000 [ 10%] (Warmup)
## Chain 4 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 2 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 3 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 1 Iteration: 200 / 1000 [ 20%] (Warmup)
## Chain 4 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 2 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 3 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 1 Iteration: 300 / 1000 [ 30%] (Warmup)
## Chain 4 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 2 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 3 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 1 Iteration: 400 / 1000 [ 40%] (Warmup)
## Chain 4 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 4 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 2 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 2 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 3 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 3 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 1 Iteration: 500 / 1000 [ 50%] (Warmup)
## Chain 1 Iteration: 501 / 1000 [ 50%] (Sampling)
## Chain 4 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 2 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 3 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 1 Iteration: 600 / 1000 [ 60%] (Sampling)
## Chain 4 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 2 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 3 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 1 Iteration: 700 / 1000 [ 70%] (Sampling)
## Chain 4 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 2 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 3 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 1 Iteration: 800 / 1000 [ 80%] (Sampling)
## Chain 4 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 2 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 3 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 1 Iteration: 900 / 1000 [ 90%] (Sampling)
## Chain 4 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 4 finished in 116.2 seconds.
## Chain 2 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 2 finished in 118.3 seconds.
## Chain 3 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 3 finished in 119.8 seconds.
## Chain 1 Iteration: 1000 / 1000 [100%] (Sampling)
## Chain 1 finished in 122.0 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 119.1 seconds.
## Total execution time: 122.4 seconds.
pnbd_extdata_fixed2_stanfit$summary()## # A tibble: 14,149 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 lp__ -8.63e+4 -8.63e+4 76.7 76.5 -8.65e+4 -8.62e+4 1.00 692.
## 2 lambda[1] 1.77e-1 1.73e-1 0.0494 0.0489 1.03e-1 2.64e-1 1.00 1907.
## 3 lambda[2] 1.44e-1 1.24e-1 0.0905 0.0772 3.61e-2 3.20e-1 1.00 1659.
## 4 lambda[3] 1.11e-1 9.89e-2 0.0638 0.0594 3.05e-2 2.30e-1 0.999 1716.
## 5 lambda[4] 7.62e-2 6.44e-2 0.0484 0.0384 2.01e-2 1.69e-1 1.00 2011.
## 6 lambda[5] 1.78e-1 9.78e-2 0.234 0.108 6.28e-3 6.32e-1 1.00 1882.
## 7 lambda[6] 1.96e-1 9.55e-2 0.277 0.109 7.05e-3 7.43e-1 1.00 1571.
## 8 lambda[7] 3.17e-1 3.02e-1 0.118 0.114 1.52e-1 5.39e-1 1.00 2202.
## 9 lambda[8] 1.98e-1 9.89e-2 0.272 0.118 5.98e-3 7.09e-1 1.00 1613.
## 10 lambda[9] 2.02e-1 9.45e-2 0.276 0.117 5.14e-3 7.74e-1 1.00 2264.
## # … with 14,139 more rows, and 1 more variable: ess_tail <dbl>
We have some basic HMC-based validity statistics we can check.
pnbd_extdata_fixed2_stanfit$cmdstan_diagnose()## Processing csv files: /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed2-1.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed2-2.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed2-3.csvWarning: non-fatal error reading adaptation data
## , /home/rstudio/workshop/stan_models/fit_pnbd_extdata_fixed2-4.csvWarning: non-fatal error reading adaptation data
##
##
## Checking sampler transitions treedepth.
## Treedepth satisfactory for all transitions.
##
## Checking sampler transitions for divergences.
## No divergent transitions found.
##
## Checking E-BFMI - sampler transitions HMC potential energy.
## E-BFMI satisfactory.
##
## Effective sample size satisfactory.
##
## Split R-hat values satisfactory all parameters.
##
## Processing complete, no problems detected.
3.5 Visual Diagnostics of the Higher Iteration Sample
Now that we have a sample from the posterior distribution we need to create a few different visualisations of the diagnostics.
parameter_subset <- c(
"lambda[1]", "lambda[2]", "lambda[3]", "lambda[4]",
"mu[1]", "mu[2]", "mu[3]", "mu[4]"
)
pnbd_extdata_fixed2_stanfit$draws(inc_warmup = FALSE) %>%
mcmc_trace(pars = parameter_subset) +
expand_limits(y = 0) +
labs(
x = "Iteration",
y = "Value",
title = "Traceplot of Sample of Lambda and Mu Values"
) +
theme(axis.text.x = element_text(size = 10))A common MCMC diagnostic is \(\hat{R}\) - which is a measure of the ‘similarity’ of the chains.
pnbd_extdata_fixed2_stanfit %>%
rhat(pars = c("lambda", "mu")) %>%
mcmc_rhat() +
ggtitle("Plot of Parameter R-hat Values")Related to this quantity is the concept of effective sample size, \(N_{eff}\), an estimate of the size of the sample from a statistical information point of view.
pnbd_extdata_fixed2_stanfit %>%
neff_ratio(pars = c("lambda", "mu")) %>%
mcmc_neff() +
ggtitle("Plot of Parameter Effective Sample Sizes")Finally, we also want to look at autocorrelation in the chains for each parameter.
pnbd_extdata_fixed2_stanfit$draws() %>%
mcmc_acf(pars = parameter_subset) +
ggtitle("Autocorrelation Plot of Sample Values")3.6 Validate the Higher Iteration Model
We now want to revalidate this model, so we repeat our steps.
pnbd_extdata_fixed2_validation_tbl <- pnbd_extdata_fixed2_stanfit %>%
recover_types(btyd_fitdata_tbl) %>%
spread_draws(lambda[customer_id], mu[customer_id], p_alive[customer_id]) %>%
ungroup() %>%
select(
customer_id, draw_id = .draw, post_lambda = lambda, post_mu = mu, p_alive
)
pnbd_extdata_fixed2_validation_tbl %>% glimpse()## Rows: 9,432,000
## Columns: 5
## $ customer_id <chr> "12346", "12346", "12346", "12346", "12346", "12346", "123…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
## $ post_lambda <dbl> 0.177912, 0.192083, 0.189967, 0.146978, 0.136297, 0.232329…
## $ post_mu <dbl> 0.030414800, 0.001790260, 0.042978100, 0.014361600, 0.0156…
## $ p_alive <dbl> 0.471672, 0.943662, 0.347124, 0.721901, 0.716274, 0.736041…
Having constructed our simulations inputs, we now generate our simulations.
precompute_dir <- "precompute/pnbd_extdata_fixed2"
precomputed_tbl <- dir_ls(precompute_dir) %>%
enframe(name = NULL, value = "sim_file") %>%
mutate(sim_file = sim_file %>% as.character())
pnbd_extdata_fixed2_validsims_lookup_tbl <- pnbd_extdata_fixed2_validation_tbl %>%
group_nest(customer_id, .key = "cust_params") %>%
mutate(
sim_file = glue(
"{precompute_dir}/sims_pnbd_extdata_fixed2_highiter_{customer_id}.rds"
)
)
exec_tbl <- pnbd_extdata_fixed2_validsims_lookup_tbl %>%
anti_join(precomputed_tbl, by = "sim_file")
if(exec_tbl %>% nrow() > 0) {
exec_tbl %>%
mutate(
calc_file = future_map2_lgl(
sim_file, cust_params,
run_pnbd_simulations_chunk,
start_dttm = as.POSIXct("2011-04-01"),
end_dttm = as.POSIXct("2011-12-10"),
.options = furrr_options(
globals = c(
"calculate_event_times", "rgamma_mucv", "gamma_mucv2shaperate",
"generate_pnbd_validation_transactions"
),
packages = c("tidyverse", "fs"),
scheduling = FALSE,
seed = 4202
),
.progress = TRUE
)
)
}
exec_tbl %>% glimpse()## Rows: 0
## Columns: 3
## $ customer_id <chr>
## $ cust_params <list<tibble[,4]>> list()
## $ sim_file <glue>
pnbd_extdata_fixed2_validsims_lookup_tbl %>% glimpse()## Rows: 4,716
## Columns: 3
## $ customer_id <chr> "12346", "12347", "12348", "12349", "12350", "12351", "123…
## $ cust_params <list<tibble[,4]>> [<tbl_df[2000 x 4]>], [<tbl_df[2000 x 4]>], […
## $ sim_file <glue> "precompute/pnbd_extdata_fixed2/sims_pnbd_extdata_fixed2_…
We now load all the simulations into a file.
pnbd_extdata_fixed2_validsims_tbl <- pnbd_extdata_fixed2_validsims_lookup_tbl %>%
mutate(
data = map(sim_file, ~ .x %>% read_rds() %>% select(draw_id, sim_tnx_count, sim_tnx_last))
) %>%
select(customer_id, data) %>%
unnest(data)
pnbd_extdata_fixed2_validsims_tbl %>% glimpse()## Rows: 9,432,000
## Columns: 4
## $ customer_id <chr> "12346", "12346", "12346", "12346", "12346", "12346", "1…
## $ draw_id <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 1…
## $ sim_tnx_count <int> 0, 7, 0, 6, 2, 5, 0, 6, 0, 8, 1, 8, 0, 8, 2, 0, 0, 6, 0,…
## $ sim_tnx_last <dttm> NA, 2011-12-03 21:14:33, NA, 2011-12-01 03:19:22, 2011-…
tnx_data_tbl <- btyd_obs_stats_tbl %>%
semi_join(pnbd_extdata_fixed2_validsims_tbl, by = "customer_id")
obs_customer_count <- tnx_data_tbl %>% nrow()
obs_total_tnx_count <- tnx_data_tbl %>% pull(tnx_count) %>% sum()
pnbd_extdata_fixed2_tnx_simsumm_tbl <- pnbd_extdata_fixed2_validsims_tbl %>%
group_by(draw_id) %>%
summarise(
.groups = "drop",
sim_customer_count = length(sim_tnx_count[sim_tnx_count > 0]),
sim_total_tnx_count = sum(sim_tnx_count)
)
ggplot(pnbd_extdata_fixed2_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_customer_count), binwidth = 10) +
geom_vline(aes(xintercept = obs_customer_count), colour = "red") +
labs(
x = "Simulated Customers With Transactions",
y = "Frequency",
title = "Histogram of Count of Customers Transacted",
subtitle = "Observed Count in Red"
)ggplot(pnbd_extdata_fixed2_tnx_simsumm_tbl) +
geom_histogram(aes(x = sim_total_tnx_count), binwidth = 50) +
geom_vline(aes(xintercept = obs_total_tnx_count), colour = "red") +
labs(
x = "Simulated Transaction Count",
y = "Frequency",
title = "Histogram of Count of Total Transaction Count",
subtitle = "Observed Count in Red"
)4 R Environment
options(width = 120L)
sessioninfo::session_info()## ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
## setting value
## version R version 4.2.0 (2022-04-22)
## os Ubuntu 20.04.5 LTS
## system x86_64, linux-gnu
## ui RStudio
## language (EN)
## collate en_US.UTF-8
## ctype en_US.UTF-8
## tz Etc/UTC
## date 2022-10-14
## rstudio 2022.02.3+492 Prairie Trillium (server)
## pandoc 2.17.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/ (via rmarkdown)
##
## ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## abind 1.4-5 2016-07-21 [1] RSPM (R 4.2.0)
## arrayhelpers 1.1-0 2020-02-04 [1] RSPM (R 4.2.0)
## assertthat 0.2.1 2019-03-21 [1] RSPM (R 4.2.0)
## backports 1.4.1 2021-12-13 [1] RSPM (R 4.2.0)
## base64enc 0.1-3 2015-07-28 [1] RSPM (R 4.2.0)
## bayesplot * 1.9.0 2022-03-10 [1] RSPM (R 4.2.0)
## bookdown 0.27 2022-06-14 [1] RSPM (R 4.2.0)
## boot 1.3-28 2021-05-03 [2] CRAN (R 4.2.0)
## bridgesampling 1.1-2 2021-04-16 [1] RSPM (R 4.2.0)
## brms * 2.17.0 2022-09-26 [1] Github (paul-buerkner/brms@a43937c)
## Brobdingnag 1.2-7 2022-02-03 [1] RSPM (R 4.2.0)
## broom 0.8.0 2022-04-13 [1] RSPM (R 4.2.0)
## bslib 0.3.1 2021-10-06 [1] RSPM (R 4.2.0)
## cachem 1.0.6 2021-08-19 [1] RSPM (R 4.2.0)
## callr 3.7.0 2021-04-20 [1] RSPM (R 4.2.0)
## cellranger 1.1.0 2016-07-27 [1] RSPM (R 4.2.0)
## checkmate 2.1.0 2022-04-21 [1] RSPM (R 4.2.0)
## cli 3.3.0 2022-04-25 [1] RSPM (R 4.2.0)
## cmdstanr * 0.5.3 2022-09-26 [1] Github (stan-dev/cmdstanr@22b391e)
## coda 0.19-4 2020-09-30 [1] RSPM (R 4.2.0)
## codetools 0.2-18 2020-11-04 [2] CRAN (R 4.2.0)
## colorspace 2.0-3 2022-02-21 [1] RSPM (R 4.2.0)
## colourpicker 1.1.1 2021-10-04 [1] RSPM (R 4.2.0)
## conflicted * 1.1.0 2021-11-26 [1] RSPM (R 4.2.0)
## cowplot * 1.1.1 2020-12-30 [1] RSPM (R 4.2.0)
## crayon 1.5.1 2022-03-26 [1] RSPM (R 4.2.0)
## crosstalk 1.2.0 2021-11-04 [1] RSPM (R 4.2.0)
## curl 4.3.2 2021-06-23 [1] RSPM (R 4.2.0)
## DBI 1.1.3 2022-06-18 [1] RSPM (R 4.2.0)
## dbplyr 2.2.0 2022-06-05 [1] RSPM (R 4.2.0)
## digest 0.6.29 2021-12-01 [1] RSPM (R 4.2.0)
## directlabels * 2021.1.13 2021-01-16 [1] RSPM (R 4.2.0)
## distributional 0.3.0 2022-01-05 [1] RSPM (R 4.2.0)
## dplyr * 1.0.9 2022-04-28 [1] RSPM (R 4.2.0)
## DT 0.23 2022-05-10 [1] RSPM (R 4.2.0)
## dygraphs 1.1.1.6 2018-07-11 [1] RSPM (R 4.2.0)
## ellipsis 0.3.2 2021-04-29 [1] RSPM (R 4.2.0)
## evaluate 0.15 2022-02-18 [1] RSPM (R 4.2.0)
## fansi 1.0.3 2022-03-24 [1] RSPM (R 4.2.0)
## farver 2.1.0 2021-02-28 [1] RSPM (R 4.2.0)
## fastmap 1.1.0 2021-01-25 [1] RSPM (R 4.2.0)
## forcats * 0.5.1 2021-01-27 [1] RSPM (R 4.2.0)
## fs * 1.5.2 2021-12-08 [1] RSPM (R 4.2.0)
## furrr * 0.3.0 2022-05-04 [1] RSPM (R 4.2.0)
## future * 1.26.1 2022-05-27 [1] RSPM (R 4.2.0)
## gamm4 0.2-6 2020-04-03 [1] RSPM (R 4.2.0)
## generics 0.1.2 2022-01-31 [1] RSPM (R 4.2.0)
## ggdist 3.1.1 2022-02-27 [1] RSPM (R 4.2.0)
## ggplot2 * 3.3.6 2022-05-03 [1] RSPM (R 4.2.0)
## ggridges 0.5.3 2021-01-08 [1] RSPM (R 4.2.0)
## globals 0.15.0 2022-05-09 [1] RSPM (R 4.2.0)
## glue * 1.6.2 2022-02-24 [1] RSPM (R 4.2.0)
## gridExtra 2.3 2017-09-09 [1] RSPM (R 4.2.0)
## gtable 0.3.0 2019-03-25 [1] RSPM (R 4.2.0)
## gtools 3.9.2.2 2022-06-13 [1] RSPM (R 4.2.0)
## haven 2.5.0 2022-04-15 [1] RSPM (R 4.2.0)
## highr 0.9 2021-04-16 [1] RSPM (R 4.2.0)
## hms 1.1.1 2021-09-26 [1] RSPM (R 4.2.0)
## htmltools 0.5.2 2021-08-25 [1] RSPM (R 4.2.0)
## htmlwidgets 1.5.4 2021-09-08 [1] RSPM (R 4.2.0)
## httpuv 1.6.5 2022-01-05 [1] RSPM (R 4.2.0)
## httr 1.4.3 2022-05-04 [1] RSPM (R 4.2.0)
## igraph 1.3.2 2022-06-13 [1] RSPM (R 4.2.0)
## inline 0.3.19 2021-05-31 [1] RSPM (R 4.2.0)
## jquerylib 0.1.4 2021-04-26 [1] RSPM (R 4.2.0)
## jsonlite 1.8.0 2022-02-22 [1] RSPM (R 4.2.0)
## knitr 1.39 2022-04-26 [1] RSPM (R 4.2.0)
## labeling 0.4.2 2020-10-20 [1] RSPM (R 4.2.0)
## later 1.3.0 2021-08-18 [1] RSPM (R 4.2.0)
## lattice 0.20-45 2021-09-22 [2] CRAN (R 4.2.0)
## lifecycle 1.0.1 2021-09-24 [1] RSPM (R 4.2.0)
## listenv 0.8.0 2019-12-05 [1] RSPM (R 4.2.0)
## lme4 1.1-29 2022-04-07 [1] RSPM (R 4.2.0)
## loo 2.5.1 2022-03-24 [1] RSPM (R 4.2.0)
## lubridate 1.8.0 2021-10-07 [1] RSPM (R 4.2.0)
## magrittr * 2.0.3 2022-03-30 [1] RSPM (R 4.2.0)
## markdown 1.1 2019-08-07 [1] RSPM (R 4.2.0)
## MASS 7.3-56 2022-03-23 [2] CRAN (R 4.2.0)
## Matrix 1.4-1 2022-03-23 [2] CRAN (R 4.2.0)
## matrixStats 0.62.0 2022-04-19 [1] RSPM (R 4.2.0)
## memoise 2.0.1 2021-11-26 [1] RSPM (R 4.2.0)
## mgcv 1.8-40 2022-03-29 [2] CRAN (R 4.2.0)
## mime 0.12 2021-09-28 [1] RSPM (R 4.2.0)
## miniUI 0.1.1.1 2018-05-18 [1] RSPM (R 4.2.0)
## minqa 1.2.4 2014-10-09 [1] RSPM (R 4.2.0)
## modelr 0.1.8 2020-05-19 [1] RSPM (R 4.2.0)
## munsell 0.5.0 2018-06-12 [1] RSPM (R 4.2.0)
## mvtnorm 1.1-3 2021-10-08 [1] RSPM (R 4.2.0)
## nlme 3.1-157 2022-03-25 [2] CRAN (R 4.2.0)
## nloptr 2.0.3 2022-05-26 [1] RSPM (R 4.2.0)
## parallelly 1.32.0 2022-06-07 [1] RSPM (R 4.2.0)
## pillar 1.7.0 2022-02-01 [1] RSPM (R 4.2.0)
## pkgbuild 1.3.1 2021-12-20 [1] RSPM (R 4.2.0)
## pkgconfig 2.0.3 2019-09-22 [1] RSPM (R 4.2.0)
## plyr 1.8.7 2022-03-24 [1] RSPM (R 4.2.0)
## posterior * 1.2.2 2022-06-09 [1] RSPM (R 4.2.0)
## prettyunits 1.1.1 2020-01-24 [1] RSPM (R 4.2.0)
## processx 3.6.1 2022-06-17 [1] RSPM (R 4.2.0)
## projpred 2.1.2 2022-05-13 [1] RSPM (R 4.2.0)
## promises 1.2.0.1 2021-02-11 [1] RSPM (R 4.2.0)
## ps 1.7.1 2022-06-18 [1] RSPM (R 4.2.0)
## purrr * 0.3.4 2020-04-17 [1] RSPM (R 4.2.0)
## quadprog 1.5-8 2019-11-20 [1] RSPM (R 4.2.0)
## R6 2.5.1 2021-08-19 [1] RSPM (R 4.2.0)
## Rcpp * 1.0.8.3 2022-03-17 [1] RSPM (R 4.2.0)
## RcppParallel 5.1.5 2022-01-05 [1] RSPM (R 4.2.0)
## readr * 2.1.2 2022-01-30 [1] RSPM (R 4.2.0)
## readxl 1.4.0 2022-03-28 [1] RSPM (R 4.2.0)
## reprex 2.0.1 2021-08-05 [1] RSPM (R 4.2.0)
## reshape2 1.4.4 2020-04-09 [1] RSPM (R 4.2.0)
## rlang * 1.0.2 2022-03-04 [1] RSPM (R 4.2.0)
## rmarkdown 2.14 2022-04-25 [1] RSPM (R 4.2.0)
## rmdformats 1.0.4 2022-05-17 [1] RSPM (R 4.2.0)
## rstan 2.26.13 2022-09-26 [1] local
## rstantools 2.2.0 2022-04-08 [1] RSPM (R 4.2.0)
## rstudioapi 0.13 2020-11-12 [1] RSPM (R 4.2.0)
## rvest 1.0.2 2021-10-16 [1] RSPM (R 4.2.0)
## sass 0.4.1 2022-03-23 [1] RSPM (R 4.2.0)
## scales * 1.2.0 2022-04-13 [1] RSPM (R 4.2.0)
## sessioninfo 1.2.2 2021-12-06 [1] RSPM (R 4.2.0)
## shiny 1.7.1 2021-10-02 [1] RSPM (R 4.2.0)
## shinyjs 2.1.0 2021-12-23 [1] RSPM (R 4.2.0)
## shinystan 2.6.0 2022-03-03 [1] RSPM (R 4.2.0)
## shinythemes 1.2.0 2021-01-25 [1] RSPM (R 4.2.0)
## StanHeaders 2.26.13 2022-09-26 [1] local
## stringi 1.7.6 2021-11-29 [1] RSPM (R 4.2.0)
## stringr * 1.4.0 2019-02-10 [1] RSPM (R 4.2.0)
## svUnit 1.0.6 2021-04-19 [1] RSPM (R 4.2.0)
## tensorA 0.36.2 2020-11-19 [1] RSPM (R 4.2.0)
## threejs 0.3.3 2020-01-21 [1] RSPM (R 4.2.0)
## tibble * 3.1.7 2022-05-03 [1] RSPM (R 4.2.0)
## tidybayes * 3.0.2.9000 2022-09-26 [1] Github (mjskay/tidybayes@1efbdef)
## tidyr * 1.2.0 2022-02-01 [1] RSPM (R 4.2.0)
## tidyselect 1.1.2 2022-02-21 [1] RSPM (R 4.2.0)
## tidyverse * 1.3.1 2021-04-15 [1] RSPM (R 4.2.0)
## tzdb 0.3.0 2022-03-28 [1] RSPM (R 4.2.0)
## utf8 1.2.2 2021-07-24 [1] RSPM (R 4.2.0)
## V8 4.2.0 2022-05-14 [1] RSPM (R 4.2.0)
## vctrs 0.4.1 2022-04-13 [1] RSPM (R 4.2.0)
## withr 2.5.0 2022-03-03 [1] RSPM (R 4.2.0)
## xfun 0.31 2022-05-10 [1] RSPM (R 4.2.0)
## xml2 1.3.3 2021-11-30 [1] RSPM (R 4.2.0)
## xtable 1.8-4 2019-04-21 [1] RSPM (R 4.2.0)
## xts 0.12.1 2020-09-09 [1] RSPM (R 4.2.0)
## yaml 2.3.5 2022-02-21 [1] RSPM (R 4.2.0)
## zoo 1.8-10 2022-04-15 [1] RSPM (R 4.2.0)
##
## [1] /usr/local/lib/R/site-library
## [2] /usr/local/lib/R/library
##
## ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
options(width = 80L)